Measures for Corpus Similarity and Homogeneity
نویسندگان
چکیده
How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by 'corpus similariti: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity can only be interpreted in the light of corpus homogeneity. We then present an operational definition of corpus similarity \vhich addresses or circumvents the problems, using purpose-built sets of aknown-similarity corpora". These KSC sets can be used to evaluate the measures. We evaluate the measures described in the literature, including three variants of the information theoretic measure 'perplexity'. A x2-based measure, using word frequencies, is shnwn to be the best of those tested.
منابع مشابه
Measuring the homogeneity and similarity of language corpora
Corpus-based methods are now dominant in Natural Language Processing (NLP) . Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need metho...
متن کاملITRI-98-07 Measures for corpus similarity and homogeneity
How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by ‘corpus similarity’: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity ...
متن کاملThe influence of example-data homogeneity on EBMT quality
Homogeneity of large corpora is still a largely unclear notion. In this study we first make a link between the notions of similarity and homogeneity : a large corpus is made of sets of documents to which may be assigned a score in similarity defined by cross-entropic measures, such similarity being implicitly expressed in the data. The distribution of the similarity scores of such subcorpora ma...
متن کاملA Method to Quantify Corpus Similarity and its Application to Quantifying the Degree of Literality in a Document
Comparing and quantifying corpora is a key issue in corpus based translation and corpus linguistics, for which there is still a notable lack of measures. This makes it difficult for a user to isolate, transpose, or extend the interesting features of a corpus to other NLP systems. In this work we address the issue of measuring similarity between corpora. We suggest a scale between two user chose...
متن کاملITRI-97-07 Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora
How similar are two corpora? A measure of corpus similarity would be very useful for language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances; for example, to judge how a newly available corpus related to existing resources, or how easy it might be to port an NLP system designed to work with one t...
متن کامل